Combine tibbles

*_join(...) is a family of functions for combining two tibbles on common variable(s) called key.

For example, take a group of people (n=7) for which we have collected age and height data:

# height 
df1 <- tibble(name = c("Isa","Jaylinn","Mila","Milas","Yara"),
           height = c(160, 172, 182, 157, 162)) 
df1
# A tibble: 5 x 2
  name    height
  <chr>    <dbl>
1 Isa        160
2 Jaylinn    172
3 Mila       182
4 Milas      157
5 Yara       162
# age
df2 <- tibble(name = c("Fiene","Jaylinn","Mila","Noah","Yara"),
           age = c(20,24,17,23,17)) 
df2
# A tibble: 5 x 2
  name      age
  <chr>   <dbl>
1 Fiene      20
2 Jaylinn    24
3 Mila       17
4 Noah       23
5 Yara       17

The information here is incomplete, for some we have only height and for some only age. Now we want to combine these tibbles into a single table with variables name, age and height. We can do this with a join function which is able to combine observations (rows) by matching on the common variable(s) between the two tibbles. Matching is done by one or more variables, in this case the variable name.

Keep in mind that for a proper joining of data, the matched variable(s), also known as the key, must be unique in both tibbles.

Join : single variable

You may control the layout of the combined table by choosing one of the three join function laid out below.

inner_join

Return all rows in df1 where there are matching values of name in df2 and all columns in both df1 and df2:

inner_join(df1,df2, by = "name")
# A tibble: 3 x 3
  name    height   age
  <chr>    <dbl> <dbl>
1 Jaylinn    172    24
2 Mila       182    17
3 Yara       162    17

left_join

Return all rows from df1 and all columns in both df1 and df2, NA for missing values in df2:

left_join(df1,df2, by = "name")
# A tibble: 5 x 3
  name    height   age
  <chr>    <dbl> <dbl>
1 Isa        160    NA
2 Jaylinn    172    24
3 Mila       182    17
4 Milas      157    NA
5 Yara       162    17

full_join

All rows and all columns in df1 and df2.

full_join(df1,df2, by = "name")
# A tibble: 7 x 3
  name    height   age
  <chr>    <dbl> <dbl>
1 Isa        160    NA
2 Jaylinn    172    24
3 Mila       182    17
4 Milas      157    NA
5 Yara       162    17
6 Fiene       NA    20
7 Noah        NA    23

Non-unique keys

If the keys are not unique in either of the tibbles then the result will be ambiguous. Take for example df1 defined above, we now add another observation with name Yara and height 168:

df1 <- tibble(name = c("Isa","Jaylinn","Mila","Milas","Yara", "Yara"),
           height = c(160, 172, 182, 157, 162, 168))
df1
# A tibble: 6 x 2
  name    height
  <chr>    <dbl>
1 Isa        160
2 Jaylinn    172
3 Mila       182
4 Milas      157
5 Yara       162
6 Yara       168

This will make the key, variable name, non-unique. Notice the result of a full join:

full_join(df1, df2, "name")
# A tibble: 8 x 3
  name    height   age
  <chr>    <dbl> <dbl>
1 Isa        160    NA
2 Jaylinn    172    24
3 Mila       182    17
4 Milas      157    NA
5 Yara       162    17
6 Yara       168    17
7 Fiene       NA    20
8 Noah        NA    23

As you can see the join found two matches for Yara in df1, so both are taken and the age for Yara in df2 is replicated.

Use only unique key in joins to avoid ambiguous results.

What if we update df2 also with name Yara with age 19 and apply a join?

df2 <- tibble(name = c("Fiene","Jaylinn","Mila","Noah","Yara", "Yara"),
           age = c(20,24,17,23,17,19)) 
df2
# A tibble: 6 x 2
  name      age
  <chr>   <dbl>
1 Fiene      20
2 Jaylinn    24
3 Mila       17
4 Noah       23
5 Yara       17
6 Yara       19
full_join(df1,df2,"name")
# A tibble: 10 x 3
   name    height   age
   <chr>    <dbl> <dbl>
 1 Isa        160    NA
 2 Jaylinn    172    24
 3 Mila       182    17
 4 Milas      157    NA
 5 Yara       162    17
 6 Yara       162    19
 7 Yara       168    17
 8 Yara       168    19
 9 Fiene       NA    20
10 Noah        NA    23

As you can see we have now 4 ambiguous observations for Yara

Variable name conflicts

Take df1 as defined above and df3:

# height (cm)
df1 <- tibble(name = c("Isa","Jaylinn","Mila","Milas","Yara"),
           height = c(160, 172, 182, 157, 162))

# height (feet)
df3 <-   df1 %>% mutate(height=height/30.48)
df3
# A tibble: 5 x 2
  name    height
  <chr>    <dbl>
1 Isa       5.25
2 Jaylinn   5.64
3 Mila      5.97
4 Milas     5.15
5 Yara      5.31

both of these tibbles have the variable height, after joining by name :

left_join(df1,df3,by="name")
# A tibble: 5 x 3
  name    height.x height.y
  <chr>      <dbl>    <dbl>
1 Isa          160     5.25
2 Jaylinn      172     5.64
3 Mila         182     5.97
4 Milas        157     5.15
5 Yara         162     5.31

we see that left_join distinguishes height from df1 and the height from df3 with suffixes .x and .y respectively.

There are several more join functions not covered here such as right_join (opposite of left_join), semi_join and anti_join (see dplyr cheat sheet).

Unique keys

base::duplicated(…) function

The base function duplicated can be used to find duplicates in a vector or a tibble.

v <- c("Isa","Jaylinn","Mila","Bonnie", "Yara", "Mila", "Isa", "Mila")
v
[1] "Isa"     "Jaylinn" "Mila"    "Bonnie"  "Yara"    "Mila"    "Isa"     "Mila"   

Here Isa and Mila occur two and three times respectively. The function duplicated returns a logical with TRUE at corresponding positions whenever it matches a value that it had already seen from position 1 to n, n being the length of the vector in this case.

duplicated(v)
[1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

duplicated(…) function does not mark the first occurence of the duplicated values with TRUE.

On tibbles the function duplicated can handle multiple variables(columns):

# the two observations on Yara are identical
df1 <- tibble(name = c("Isa","Jaylinn","Mila","Milas","Yara", "Yara"),
           height = c(160, 172, 182, 157, 162, 162))
df1
# A tibble: 6 x 2
  name    height
  <chr>    <dbl>
1 Isa        160
2 Jaylinn    172
3 Mila       182
4 Milas      157
5 Yara       162
6 Yara       162
duplicated(df1)
[1] FALSE FALSE FALSE FALSE FALSE  TRUE

We can use this logical vector to filter observations from df1:

df1 %>%  filter(duplicated(df1))
# A tibble: 1 x 2
  name  height
  <chr>  <dbl>
1 Yara     162

Note that we only get 1 row as the result and not 2 because of the behaviour of duplicated function which does not mark the first match.

Test uniqueness of a key set

We can now test whether certain combinations of variables can uniquely identify each observation in the dataset. Let us take the variables {name,age} in the pulse dataset:

key1 <- pulse %>%  select(name,age)
key1
# A tibble: 110 x 2
   name        age
   <chr>     <dbl>
 1 Bonnie       18
 2 Melanie      19
 3 Consuelo     18
 4 Travis       18
 5 Lauri        18
 6 George       22
 7 Cherry       20
 8 Francesca    18
 9 Sonja        19
10 Troy         23
# … with 100 more rows

If the duplicated function returns FALSE for all positions in key1 tibble then it means it did not find any duplicates.

# sum over a logical vector counts the number of TRUE values
sum(duplicated(key1)) 
[1] 2

the sum here is non-zero therefore there are duplicated values, i.e. the key set key1 is not unique.

Join : multiple variables as the key

To illustrate let’s split the pulse dataset into two separate tibbles with common variables name and height:

p1 <- pulse  %>% select(name,height,age,gender,year)
p1
# A tibble: 110 x 5
   name      height   age gender  year
   <chr>      <dbl> <dbl> <chr>  <dbl>
 1 Bonnie       173    18 female  1993
 2 Melanie      179    19 female  1993
 3 Consuelo     167    18 female  1993
 4 Travis       195    18 male    1993
 5 Lauri        173    18 female  1993
 6 George       184    22 male    1993
 7 Cherry       162    20 female  1993
 8 Francesca    169    18 female  1993
 9 Sonja        164    19 female  1993
10 Troy         168    23 male    1993
# … with 100 more rows
p2 <- pulse  %>% select(-age,-gender,-year)
p2
# A tibble: 110 x 10
   id     name      height weight smokes alcohol exercise ran   pulse1 pulse2
   <chr>  <chr>      <dbl>  <dbl> <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl>
 1 1993_A Bonnie       173     57 no     yes     moderate sat       86     88
 2 1993_B Melanie      179     58 no     yes     moderate ran       82    150
 3 1993_C Consuelo     167     62 no     yes     high     ran       96    176
 4 1993_D Travis       195     84 no     yes     high     sat       71     73
 5 1993_E Lauri        173     64 no     yes     low      sat       90     88
 6 1993_F George       184     74 no     yes     low      ran       78    141
 7 1993_G Cherry       162     57 no     yes     moderate sat       68     72
 8 1993_H Francesca    169     55 no     yes     moderate sat       71     77
 9 1993_I Sonja        164     56 no     yes     high     sat       68     68
10 1993_J Troy         168     60 no     yes     moderate ran       88    150
# … with 100 more rows

p1 has the common variables {name,height} (the key) and additional variables {age,gender,year}, and p2 has the same key variables {name,height} with some other measurement variables such as pulse1 pulse2 etc.

Is the combination {name,height} a valid choice as the unique key to join two tibbles p1 and p2?

Yes. This because the combination is unique in the table. You can check the uniqueness of {name,height} with the command duplicated, see also ?base::duplicated. The statement sum(duplicated(pulse %>% select(name,height)))==0 tests whether the number of duplicates is equal to 0.


Now we want to join them back together using full_join:

full_join(p1,p2) # common variables are used when no variable is specified. 
Joining, by = c("name", "height")
# A tibble: 110 x 13
   name      height   age gender  year id     weight smokes alcohol exercise ran   pulse1 pulse2
   <chr>      <dbl> <dbl> <chr>  <dbl> <chr>   <dbl> <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl>
 1 Bonnie       173    18 female  1993 1993_A     57 no     yes     moderate sat       86     88
 2 Melanie      179    19 female  1993 1993_B     58 no     yes     moderate ran       82    150
 3 Consuelo     167    18 female  1993 1993_C     62 no     yes     high     ran       96    176
 4 Travis       195    18 male    1993 1993_D     84 no     yes     high     sat       71     73
 5 Lauri        173    18 female  1993 1993_E     64 no     yes     low      sat       90     88
 6 George       184    22 male    1993 1993_F     74 no     yes     low      ran       78    141
 7 Cherry       162    20 female  1993 1993_G     57 no     yes     moderate sat       68     72
 8 Francesca    169    18 female  1993 1993_H     55 no     yes     moderate sat       71     77
 9 Sonja        164    19 female  1993 1993_I     56 no     yes     high     sat       68     68
10 Troy         168    23 male    1993 1993_J     60 no     yes     moderate ran       88    150
# … with 100 more rows

It is a better practice to explicitly specify the variables:

pulse_joined <- full_join(p1,p2, by = c("name","height"))
pulse_joined
# A tibble: 110 x 13
   name      height   age gender  year id     weight smokes alcohol exercise ran   pulse1 pulse2
   <chr>      <dbl> <dbl> <chr>  <dbl> <chr>   <dbl> <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl>
 1 Bonnie       173    18 female  1993 1993_A     57 no     yes     moderate sat       86     88
 2 Melanie      179    19 female  1993 1993_B     58 no     yes     moderate ran       82    150
 3 Consuelo     167    18 female  1993 1993_C     62 no     yes     high     ran       96    176
 4 Travis       195    18 male    1993 1993_D     84 no     yes     high     sat       71     73
 5 Lauri        173    18 female  1993 1993_E     64 no     yes     low      sat       90     88
 6 George       184    22 male    1993 1993_F     74 no     yes     low      ran       78    141
 7 Cherry       162    20 female  1993 1993_G     57 no     yes     moderate sat       68     72
 8 Francesca    169    18 female  1993 1993_H     55 no     yes     moderate sat       71     77
 9 Sonja        164    19 female  1993 1993_I     56 no     yes     high     sat       68     68
10 Troy         168    23 male    1993 1993_J     60 no     yes     moderate ran       88    150
# … with 100 more rows

You may check:

all_equal(pulse,pulse_joined) # caution: all_equal is an experimental function
[1] TRUE

⚠️ The function all_equal is an experimental function and may become obsolete!



Copyright © 2021 Biomedical Data Sciences (BDS) | LUMC